介绍ggplot2 对于图层的理解 data, aes,geom

Load the ggplot2 package

library(ggplot2)

Explore the mtcars data frame with str()保持良好的习惯str(),head(), summary()

内置数据集

str(mtcars)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

介绍语法(Grammar of Graphics)

data, aesthetics, and geom layers

mpg (miles per galon) weight (in thousands of pounds) disp(displacement of the car engine) 在一副图里展示多个变量 # A scatter plot has been made for you

ggplot(mtcars, aes(x = wt, y = mpg)) +
  geom_point()

Replace color with the correct column

ggplot(mtcars, aes(x = wt, y = mpg, color = disp)) +
  geom_point()

Replace disp with the correct column

ggplot(mtcars, aes(x = wt, y = mpg, size = disp)) +
  geom_point()

有没有发现图的边上还增加了图注(legend)aes里面出了color, size,还有shape,但是shape不能用在这里

介绍ggplot语法: data,aesthetic,geometries,介绍qplot,statistics,坐标系和分面,themes(本次课程介绍到qplot)

Explore the diamonds data frame with str()这是好习惯,可以再加上head(diamonds)

str(diamonds)
## Classes 'tbl_df', 'tbl' and 'data.frame':    53940 obs. of  10 variables:
##  $ carat  : num  0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
##  $ cut    : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
##  $ color  : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
##  $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
##  $ depth  : num  61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
##  $ table  : num  55 61 65 58 58 57 57 55 61 61 ...
##  $ price  : int  326 326 327 334 335 336 336 337 337 338 ...
##  $ x      : num  3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
##  $ y      : num  3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
##  $ z      : num  2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...

Add geom_point() with +,这是神奇的加号

ggplot(diamonds, aes(x = carat, y = price)) +
  geom_point()

Add geom_point() and geom_smooth() with + 理解这个加号

ggplot(diamonds, aes(x = carat, y = price))+
  geom_point()+
  geom_smooth()

2 - show only the smooth line

ggplot(diamonds, aes(x = carat, y = price)) +
  geom_smooth()

3 - assign the correct value to color in aes()

ggplot(diamonds, aes(x = carat, y = price, color = clarity)) +
  geom_smooth()

4 - Keep the color settings from previous command. Plot only the points with argument alpha.

alpha调节的是透明度

ggplot(diamonds, aes(x = carat, y = price, color = clarity)) +
  geom_point(alpha = 0.4) 

Create the object containing the data and aes layers: dia_plot

还记得变量么,变量就是储存器,呼之即来挥之即去

dia_plot <- ggplot(diamonds, aes(x = carat, y = price))

Add a geom layer with + and geom_point()

dia_plot + geom_point()

Add the same geom layer, but with aes() inside

dia_plot + geom_point(aes(color = clarity))

DATA

理解普通包和ggplot2的区别

Plot the correct variables of mtcars

plot(mtcars$wt, mtcars$mpg, col = mtcars$cyl)

Change cyl inside mtcars to a factor

mtcars$fcyl <- as.factor(mtcars$cyl)

Make the same plot as in the first instruction

plot(mtcars$wt, mtcars$mpg, col = mtcars$fcyl)

描述任务,这是普通包的做法

Use lm() to calculate a linear model and save it as carModel

carModel <- lm(mpg ~ wt, data = mtcars)

Basic plot

mtcars$cyl <- as.factor(mtcars$cyl)
plot(mtcars$wt, mtcars$mpg, col = mtcars$cyl)
#Call abline() with carModel as first argument and set lty to 2
abline(carModel, lty = 2)

# Plot each subset efficiently with lapply # You don’t have to edit this code

plot(mtcars$wt, mtcars$mpg, col = mtcars$cyl)
lapply(mtcars$cyl, function(x) {
  abline(lm(mpg ~ wt, mtcars, subset = (cyl == x)), col = x)
  })
## [[1]]
## NULL
## 
## [[2]]
## NULL
## 
## [[3]]
## NULL
## 
## [[4]]
## NULL
## 
## [[5]]
## NULL
## 
## [[6]]
## NULL
## 
## [[7]]
## NULL
## 
## [[8]]
## NULL
## 
## [[9]]
## NULL
## 
## [[10]]
## NULL
## 
## [[11]]
## NULL
## 
## [[12]]
## NULL
## 
## [[13]]
## NULL
## 
## [[14]]
## NULL
## 
## [[15]]
## NULL
## 
## [[16]]
## NULL
## 
## [[17]]
## NULL
## 
## [[18]]
## NULL
## 
## [[19]]
## NULL
## 
## [[20]]
## NULL
## 
## [[21]]
## NULL
## 
## [[22]]
## NULL
## 
## [[23]]
## NULL
## 
## [[24]]
## NULL
## 
## [[25]]
## NULL
## 
## [[26]]
## NULL
## 
## [[27]]
## NULL
## 
## [[28]]
## NULL
## 
## [[29]]
## NULL
## 
## [[30]]
## NULL
## 
## [[31]]
## NULL
## 
## [[32]]
## NULL
# This code will draw the legend of the plot 不能单独运行
legend(x = 5, y = 33, legend = levels(mtcars$cyl),
       col = 1:3, pch = 1, bty = "n")

看看ggplot是怎么弄的?

Convert cyl to factor (don’t need to change)

mtcars$cyl <- as.factor(mtcars$cyl)

Plot 1: add geom_point() to this command to create a scatter plot

ggplot(mtcars, aes(x = wt, y = mpg, col = cyl)) +
  geom_point()

Plot 2: include the lines of the linear models, per cyl

ggplot(mtcars, aes(x = wt, y = mpg, col = cyl)) +
  geom_point()+ 
  geom_smooth(method = "lm", se = FALSE)

Plot 3: include a lm for the entire dataset in its whole

ggplot(mtcars, aes(x = wt, y = mpg, col = cyl)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  geom_smooth(aes(group = 1), method = "lm", se = FALSE, linetype = 2) 

###the group aesthetic will tell ggplot() to draw a single linear model through all the points. ###高下立判!!

比较两种作图的方法:

Option 1

ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width)) +
  geom_point() +
  geom_point(aes(x = Petal.Length, y = Petal.Width), col = "red")

Option 2 这种方法更好,但是没有iris.wide这个数据,需要我们自己转换

ggplot(iris.wide, aes(x = Length, y = Width, col = Part)) + geom_point()

Tidy Data 介绍tidy data的重要性,为什么要这样做

生成iris.tidy

library(tidyverse)
iris.tidy <- iris %>% 
  gather(pm,Value,-Species) %>% 
  separate(pm,c("Part","Measure"),sep = "\\.") 

如果不能理解 %>% 符号,分步讲解结果

iris.tidy_g <- gather(iris,pm,Value,-Species)
iris.tidy <- separate(iris.tidy_g,pm,c("Part","Measure"),sep = "\\.")

生成iris.wide, head数据后发现有两列观察值,应该是spread的结果

我们尝试把iris.tidy spread一下(这里正好演示一下tidy包的使用)

iris.wide <- iris.tidy %>% spread(Measure, value) # 会报错的原因是行重复

下面是解决方案:

Add column with unique ids,在此head一下数据,

iris$Flower <- 1:nrow(iris)

先在宽转长,gather,在分列,长转宽,spread,最终筛选出需要的数据

iris.wide <- iris %>%
  gather(key, value, -Species, -Flower) %>%
  separate(key, c("Part","Measure"), "\\.") %>%
  spread(Measure, value) %>% 
  select(-Flower)

将iris还原成原来的数据,可选

iris.tidy_g <- gather(iris,pm,value,-Species,-Flower)
iris.tidy_s <- separate(iris.tidy_g,pm,c("Part","Measure"),sep = "\\.")
iris.tidy_sp <- spread(iris.tidy_s,Measure,value)
iris.wide <- select(iris.tidy_sp,-Flower)
iris <- select(iris,-Flower)

打起精神要作图了

Consider the structure of iris, iris.wide and iris.tidy (in that order)

str(iris)
## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
str(iris.wide)
## 'data.frame':    300 obs. of  4 variables:
##  $ Species: Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Part   : chr  "Petal" "Sepal" "Petal" "Sepal" ...
##  $ Length : num  1.4 5.1 1.4 4.9 1.3 4.7 1.5 4.6 1.4 5 ...
##  $ Width  : num  0.2 3.5 0.2 3 0.2 3.2 0.2 3.1 0.2 3.6 ...
str(iris.tidy)
## 'data.frame':    600 obs. of  4 variables:
##  $ Species: Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Part   : chr  "Sepal" "Sepal" "Sepal" "Sepal" ...
##  $ Measure: chr  "Length" "Length" "Length" "Length" ...
##  $ Value  : num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...

暂时要得到的图片:思考用哪种数据来得,感受一下分面

Fill in the ___ to produce the plot given to the right

ggplot(iris.tidy, aes(x = Species, y = Value, col = Part)) +
  geom_jitter() +
  facet_grid(.~Measure)

The 3 data frames (iris, iris.wide and iris.tidy) are available in your environment

Execute head() on iris, iris.wide and iris.tidy (in that order)

head(iris) head(iris.wide) head(iris.tidy)

Think about which dataset you would use to get the plot shown

Fill in the ___ to produce the plot given to the right

ggplot(iris.wide, aes(x = Length, y = Width, color = Part)) +
  geom_jitter() +
  facet_grid(. ~ Species)

##可以尝试一下不要分面的感觉(如果感觉分面引入的太快,可以暂时不用放在心上)

Aesthetic 图形属性,美学属性

在geom中应该叫做属性 All about aesthetics # 1 - Map mpg to x and cyl to y

ggplot(mtcars, aes(x = mpg, y = cyl)) +
  geom_point()

2 - Reverse: Map cyl to x and mpg to y

ggplot(mtcars, aes(x= cyl, y = mpg)) +
  geom_point()

3 - Map wt to x, mpg to y and cyl to col

ggplot(mtcars, aes(x = wt, y = mpg, col = cyl)) +
  geom_point()

4 - Change shape and size of the points in the above plot

ggplot(mtcars, aes(x =wt, y = mpg, col = cyl)) +
  geom_point(shape = 1, size = 4)

color和fill的区别,geom_point是个例外我们使用color代替fill 1.The default geom_point() uses shape = 19(a solid circle with an outline the same colour as the inside) 2.Good alternatives are shape = 1 (hollow) and shape = 16(solid, no outline) 3.A really nice alternative is shape = 21 which allows you to use both fill for the inside and col for the outline!

am and cyl are factors, wt is numeric

class(mtcars$am)
## [1] "numeric"
class(mtcars$cyl)
## [1] "factor"
class(mtcars$wt)
## [1] "numeric"

转换成因子,如果不转,也可以,做完了就知道很丑

mtcars$am <- as.factor(mtcars$am)
mtcars$cyl <- as.factor(mtcars$cyl)
class(mtcars$am)
## [1] "factor"
class(mtcars$cyl)
## [1] "factor"
class(mtcars$wt)
## [1] "numeric"

1.From the previous exercise

ggplot(mtcars, aes(x = wt, y = mpg, col = cyl)) +
  geom_point(shape = 1, size = 4)

2. - Map cyl to fill,实际上是shape1是中空的,不能上色

ggplot(mtcars, aes(x = wt, y = mpg, fill = cyl)) +
  geom_point(shape = 1, size = 4)

3. - Change shape and alpha of the points in the above plot,不能上色就改一下shape,21

ggplot(mtcars, aes(x = wt, y = mpg, fill = cyl)) +
  geom_point(shape = 21, size = 4,alpha = 0.6)

4. - Map am to col in the above plot,在fill的基础上,加上边框色col

ggplot(mtcars, aes(x = wt, y = mpg, fill = cyl,col = am)) +
  geom_point(shape = 21, size = 4,alpha = 0.6)

如何选择表达方式:

Map cyl to size

ggplot(mtcars, aes(x = wt, y= mpg, size = cyl)) +
  geom_point()
## Warning: Using size for a discrete variable is not advised.

Map cyl to alpha

ggplot(mtcars, aes(x = wt, y= mpg,alpha = cyl)) +
  geom_point()

Map cyl to shape

ggplot(mtcars, aes(x = wt, y= mpg,shape = cyl)) +
  geom_point()

Map cyl to label

ggplot(mtcars, aes(x = wt, y= mpg,label = cyl)) +
  geom_text()

关于属性 attributes

Shapes in R can have a value from 1-25. Shapes 1-20 can only accept a color aesthetic, but shapes 21-25 have both a color and a fill aesthetic.

Define a hexadecimal color

my_color <- "#4ABEFF"

1 - First scatter plot, with col aesthetic:

map wt onto x, mpg onto y and cyl onto color:

ggplot(mtcars, aes(x = wt, y = mpg, color = cyl)) +
  geom_point()

2 - Plot 1, but set col attributes in geom layer:体会的是覆盖的感觉

Overwrite the color of the points inside geom_point() to my_color

ggplot(mtcars, aes(x = wt, y = mpg, color = cyl)) +
  geom_point(color = my_color)

3 - Plot 2, with fill instead of col aesthetic, plut shape and size attributes in geom layer.

map cyl to fill instead of col and set the attributes size to 10, shape to 23 and color to my_color inside geom_point()

不是每一个图形都需要fill的

ggplot(mtcars, aes(x = wt, y = mpg, fill = cyl)) +
  geom_point(size = 10, shape = 23, color = my_color)

再来一波,复习

Expand to draw points with alpha 0.5,这时候fill是没有用的,尝试改成col

ggplot(mtcars, aes(x = wt, y = mpg, fill = cyl)) +
  geom_point(alpha = 0.5)

Expand to draw points with shape 24 and color yellow

ggplot(mtcars, aes(x = wt, y = mpg, fill = cyl)) +
  geom_point(shape = 24, color = "yellow")

Expand to draw text with label rownames(mtcars) and color red

ggplot(mtcars, aes(x = wt, y = mpg, fill = cyl)) +
  geom_text(label = rownames(mtcars), color = "red")

火力全开:

mpg – Miles/(US) gallon 油耗 cyl – Number of cylinders 气缸数 disp – Displacement (cu.in.) 排量 hp – Gross horsepower 马力 drat – Rear axle ratio 后轴比 wt – Weight (lb/1000) 重量 qsec – 1/4 mile time 1/4英里冲刺 vs – V/S engine. 发动机 am – Transmission (0 = automatic, 1 = manual) 传动 gear – Number of forward gears 前进档数 carb – Number of carburetors 化油器

过犹不及 分清楚aesthetics and attributes Variables in a data frame are mapped to aesthetics in aes(). (e.g. aes(col = cyl)) within ggplot() Visual elements are set by attributes in specific geom layers (geom_point(col = “red”)).

Map mpg onto x, qsec onto y and factor(cyl) onto col (3 aesthetics):

ggplot(mtcars, aes(x = mpg, y = qsec, col = factor(cyl))) +
  geom_point()

Add mapping: factor(am) onto shape (now 4 aesthetics):

ggplot(mtcars, aes(x = mpg, y = qsec, col = factor(cyl), shape = factor(am))) +
  geom_point()

Add mapping: (hp/wt) onto size (now 5 aesthetics,着实厉害):

ggplot(mtcars, aes(x = mpg, y = qsec, col = factor(cyl), shape = factor(am), size = (hp/wt))) +
  geom_point()

##Modiing Aesthetics position jitter Scale Function

The base layer, cyl.am, is available for you

cyl.am <- ggplot(mtcars, aes(x = factor(cyl), fill = factor(am)))

Add geom (position = “stack” by default)

cyl.am + 
  geom_bar()

Fill - show proportion

cyl.am + 
  geom_bar(position = "fill") 

Dodging - principles of similarity and proximity

cyl.am +
  geom_bar(position = "dodge")

Clean up the axes with scale_ functions,完善legend

val = c("#E41A1C", "#377EB8")
lab = c("Manual", "Automatic")
cyl.am +
  geom_bar(position = "dodge") +
  scale_x_discrete("Cylinders") + 
  scale_y_continuous("Number") +
  scale_fill_manual("Transmission", 
                    values = val,
                    labels = lab)

Setting a dummy aesthetic 理解即可

1 - Create jittered plot of mtcars, mpg onto x, 0 onto y

ggplot(mtcars, aes(x = mpg, y =0)) + geom_jitter() + geom_point()

2 - Add function to change y axis limits

ggplot(mtcars, aes(x = mpg, y =0)) + geom_jitter() + geom_point()+ scale_y_continuous(limits = c(-2,2))

如何选择图形??overplotting的问题如何解决

图形是为了更加直观,不是为了更加好看,越用心越难看

Basic scatter plot: wt on x-axis and mpg on y-axis; map cyl to col

ggplot(mtcars, aes(x = wt, y = mpg, col = factor(cyl))) +
  geom_point(size =4)

Hollow circles - an improvement

ggplot(mtcars, aes(x = wt, y = mpg, col = factor(cyl))) +
  geom_point(size =4,shape = 1)

Add transparency - very nice

ggplot(mtcars, aes(x = wt, y = mpg, col = factor(cyl))) +
  geom_point(size =4,alpha = 0.6) 

加深理解overplotting

Scatter plot: carat (x), price (y), clarity (color)

ggplot(diamonds, aes(x = carat, y = price, color = clarity)) +
  geom_point()

# Adjust for overplotting #Copy the above functions and set the alpha to 0.5

ggplot(diamonds, aes(x = carat, y = price, color = clarity)) +
  geom_point(alpha = 0.5)

Scatter plot: clarity (x), carat (y), price (color)

ggplot(diamonds, aes(x = clarity, y = carat, color = price)) +
  geom_point(alpha = 0.5)

Dot plot with jittering

ggplot(diamonds, aes(x = clarity, y = carat, color = price)) +
  geom_point(alpha = 0.5,position = "jitter")

进入新的篇章 geometries

要讲三种,scatter, bar,line ##Scatter plots and jittering (2)目的是为了防止overplotting # Shown in the viewer:

ggplot(mtcars, aes(x = cyl, y = wt)) +
  geom_point()

# Solutions: # 1 - With geom_jitter()

ggplot(mtcars, aes(x = cyl, y = wt)) +
  geom_jitter()

2 - Set width in geom_jitter()

ggplot(mtcars, aes(x = cyl, y = wt)) +
  geom_jitter(width = 0.1)

# 3 - Set position = position_jitter() in geom_point() ()

ggplot(mtcars, aes(x = cyl, y = wt)) +
  geom_point(position =  position_jitter(0.1))

jitter can be 1) an argument in geom_point(position = ‘jitter’), 2) a geom itself, geom_jitter(), or 3) a position function, position_jitter(0.1)

Scatter plots and jittering (2)

The Vocab dataset contains information about the years of education and integer score on a vocabulary test for over 21,000 individuals based on US General Social Surveys from 1972-2004. # Vocab数据集在car这个包里面 # install.packages(“car”) # Examine the structure of Vocab

library(car)
str(Vocab)
## 'data.frame':    21638 obs. of  4 variables:
##  $ year      : int  2004 2004 2004 2004 2004 2004 2004 2004 2004 2004 ...
##  $ sex       : Factor w/ 2 levels "Female","Male": 1 1 2 1 2 2 1 2 2 1 ...
##  $ education : int  9 14 14 17 14 14 12 10 11 9 ...
##  $ vocabulary: int  3 6 9 8 1 7 6 6 5 1 ...

Basic scatter plot of vocabulary (y) against education (x). Use geom_point()

ggplot(Vocab,aes(x = education, y = vocabulary)) +
  geom_point()

Use geom_jitter() instead of geom_point()

ggplot(Vocab,aes(x = education, y = vocabulary)) +
  geom_jitter()

Using the above plotting command, set alpha to a very low 0.2

ggplot(Vocab,aes(x = education, y = vocabulary)) +
  geom_jitter(alpha = 0.2)

Using the above plotting command, set the shape to 1

ggplot(Vocab,aes(x = education, y = vocabulary)) +
  geom_jitter(alpha = 0.2,shape = 1)

进入bar Plots

hisogram直方图